1. About the project

1.1 Feeling right now

I think I drank too much tea while doing the first set of DataCamp exercises.

1.2 Hopes/expectations for the course

  • My original motivation for taking the course stems from the fact that although I have already for a while used RStudio as a default tool for doing univariate statistics and figures related to them, so far I’ve felt intimidated to take it further to analyzing multivariate data. Which is a shame, as I think in the long run it would be much more efficient (timewise) to use R for all the statistics I do.
  • From a less-programming/more-content point of view, the data I usually handle (plant secondary metabolites) tends to be multivariate and having a more variable set of multivariate methods in my statistical toolbox might yield important insights. Sadly, even if some of these methods have been covered in the basic statistics courses I took, it’s been a while since then, and I don’t feel confident enough to just start exploring and using the tools. So during the course I hope to develop a basic sense of
    1. how to use the analytical tools for multivariate data,
    2. what kind of questions about my data I can answer with them,
      and
    3. what are the worst caveats I should avoid as a beginner.
  • I never used GitHub before but can’t be too painful to learn, I hope.

1.3 Where did I found out about the course?

I read about the course from UEF Yammer some time ago, and got interested for (or despite) the abovementioned reasons, so here we are.


link to my GitHub


2. Regression and model validation

2.1 Data

dat <- read.csv("D:/Opiskelu/MOOC-kurssi/IODS-project/data/lrn14_dat.csv") 
str(dat)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
summary(dat$age)   # checking range of values on column 'age'
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   21.00   22.00   25.51   27.00   55.00
summary(dat$points)   # checking range of values on column 'points'
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

Dataset is a summary of questionnaire results from 166 students who completed the exam, for which gender (M=male, F=female), age (17-55 years), and exam results (points) is known. The students with 0 points from the exam were removed from the data; for the students who attempted to finish the exam, exam results vary from 7-33 points. The questionnaire included questions measuring attitude and different learning strategies (deep, surface, or strategic learning). Corresponding summary variables deep, surf and stra have been transformed to equal scale by taking rowmeans (by calculating means for all questions measuring that specific learning strategy, done separately for each student.) Attitude has been brought to the same scale with learning strategies by dividing the initial value by 10.

Gender is coded as factor and other variables as numeric.

2.2 Graphical exploration of the data

It might be that different genders have different trends in variables. In order to visually check the possible differences, we create subsets of the data by gender for easy plotting.

dat_males <- subset(dat,gender=="M")  
dat_females <- subset(dat,gender=="F")  

Distributions of numerical variables
Boxplots will summarize data distribution nicely. However, if variables are in a very different scale, automatic scaling may cause some of the plots to show poorly. So, we first check which of the numerical variables are in the same range and thus, can be shown in the same plot.

summary(dat) # check scales of variables   
##  gender       age           attitude          deep            surf      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.583  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.417  
##          Median :22.00   Median :3.200   Median :3.667   Median :2.833  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :2.787  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.167  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :4.333  
##       stra           points     
##  Min.   :1.250   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:19.00  
##  Median :3.188   Median :23.00  
##  Mean   :3.121   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :33.00

Age (column 2) and points (column 7) vary approximately on the same range. The remaining variables (columns 3-6) also share a similar range.

par(mfrow=c(1,2)) # graphical setting tell R that I want 2 plots to appear side by side
boxplot(dat_males[,c(2,7)],main="males") # index variables [rows,columns] are used to specify that in this plot, we want to plot data in columns 2 and 7     
boxplot(dat_females[,c(2,7)],main="females")     

boxplot(dat_males[,c(3:6)],main="males")  
boxplot(dat_females[,c(3:6)],main="females")  

Distribution of age is scewed (most students are between 20-30 years in both genders, but there are also several older students), but other variables seem to at least resemble a normal distribution. From the boxplots, it looks like there might be a gender difference in attitude (attitudes of males seem generally higher than attitudes of females), but no other gender differences are obvious from these figures.

Relationships between numerical variables

pairs(dat[,-1],col=dat$gender) #scatterplot matrix of all numerical variables (excluding column 1, which is a categorical variable), points are colored according to gender  

With so many variables the plots are a bit small and the potential trends are slightly difficult to see, but it looks like there might be a positive linear associations between variable pairs attitudedeep learning and attitudepoints, and negative associations between variable pairs agesurface learning, attitudesurface learning, and deep learningsurface learning. This would mean (if not caused by pure chance) that people with positive attitude tend to achieve higher points in exam and tend to aim for deep learning instead of surface learning. Youngest students rarely had high scores in deep learning, but this did not prevent some of them from reaching very good test scores.

2.3 Linear regression for exam results (points) with three explanatory variables

In order to select variables most likely to explain exam points, let’s calculate correlations between that and other numerical variables.

cor(dat[,c(2:6)],dat$points)  
##                 [,1]
## age      -0.09319032
## attitude  0.43652453
## deep     -0.01014849
## surf     -0.14435642
## stra      0.14612247

Based on the printed table, three variables most strongly associated with points are
1. attitude
2. strategic learning
3. surface learning,
so we select these variables as initial explanatory variables.

Initial regression model

mod1 <- lm(points~attitude+surf+stra,data=dat)
summary(mod1)  
## 
## Call:
## lm(formula = points ~ attitude + surf + stra, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Model summary shows some summary statistics of model residuals (representing the difference between model prediction and actual values). In this case, the median is fairly close to zero and min and maximum are approximately within the same range (between 10-20). This gives us some idea about the distribution of residuals (which is centered close to zero and does not seem too skewed). We will need to check the assumptions of the final model more promptly, but so far this doesn’t look bad.

The next thing in the summary is the coefficients table. Based on the explanatory variables (attitude, surf, and stra), R has fitted a linear model aiming to predict points as closely as possible (to minimize residuals). Estimates essentially define this best-fitting model. Intercept tells the predicted number of points for an imaginary student with attitude, surface learning and strategic learning having a value of 0. Estimates for the other explanatory variables (which in this case were numeric) tell how much the predicted number of points from an exam would change if these variables increase by 1 (e.g., increasing attitude by 1 increases the predicted exam points by 3.4). Based on the non-explained variability on the data, R has also calculated us standard error for each estimate.

Each coefficient is followed by a statistical test (t and P values). These values are related to a t-test testing the null hypothesis that Estimate equals zero. P-value (Pr(>|t|) in the Coefficients-table) tells us the probability of getting an estimate like this by pure chance in a case where the variable in question does not affect exam results. So, if P-value is very low (typically, below 0.05), the data is so convincing that we will reject the null hypothesis and accept the alternative hypothesis (that the estimate really is different from zero). In this case, we interpret the t-test results so that attitude has a positive effect on points, but the effects of surf or stra are not clear enough to be considered statistically significant.

Finally, the model summary provides the degrees of freedom for the model (number of observations - number of estimates) and gives us R-squared -values, which is a measure of model fit. In this case, R2=0.20 tells us that 20% of the variability in the data can be explained by the selected explanatory variables. Adjusted R-squared adds a penalty for including too many variables which do not contribute to the model fit (overfitting the model). F-test at the end of the model summary is related to null hypothesis that all model estimates (including intercept) are zero.

Final regression model
Now, our initial model included two variables (surf and stra) which did not affect exam points, so these variables can be left out from the final model.

mod <- lm(points~attitude,data=dat)

2.4 Linear regression for exam results (points)

summary(mod)
## 
## Call:
## lm(formula = points ~ attitude, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

From the summary of this model, we can see that the predicted values of exam points can be calculated as \[points=11.64+3.53*attitude\]
The estimates for both the intercept and attitude are highly significantly (P<0.001) different from zero, which means that we if the model assumptions were met, we can trust these estimates to reflect the situation in this data. For every 3.5 point increase in attitude we can expect the exam points of a student to increase by 1.

The multiple R2 tells us that the variability in attitude explains 19% of the variability in exam points of students.

conclusion 1: Attitude has a positive effect on exam scores.

2.5 Model assumptions

Before we rush to write a ground-breaking manuscript about our conclusion 1, we want to check that the general assumptions for fitting linear models were met. Most of these assumptions can be graphically explored by drawing plots about the model residuals (difference between predicted and observed values).

par(mfrow=(c(1,2)))
plot(mod,which=1) #a diagnostic plot 'Residuals vs Fitted'
plot(dat$attitude,dat$points, xlab="attitude", ylab="points", main="Points vs attitude") #predicted vs. explanatory variable
abline(mod,col="blue") #adding a regression line (in blue) of the model to the figure. Residuals are the vertical distance of each point from this line.

First, we plot residuals against fitted values and the target variable (points) against the explanatory variable (attitude). These plots tell us whether the assumed linear relationship between points and attitude was correct. This seems to be the case - there is no obvious pattern on the Residuals vs Fitted, which tells us that the model describes the data fairly well. Another point to note is that the variability of residuals does not change too much when the fitted value changes - this tells us that the data is fairly homoscedastic. The red line in the residual figure is the mean of residuals and it remains fairly close to zero, which is also a good thing, because model assumptions expect the residuals to be centered around 0.

par(mfrow=(c(1,2)))
plot(mod,which=c(2,5)) #diagnostic plots 'Normal Q-Q' and 'Residuals vs Leverage'

QQ-plot (Normal Q-Q) examines the distribution of residuals by comparing their distribution actual distribution (standardized residuals) into a normal distribution (theoretical quantiles). If the residuals follow the dashed line fairly well (as they do in this case), we can interpret that the residuals follow a normal distribution.
Residuals vs Leverage explores whether there are data points with unusually high impact on the model. Most of the points in this plot are fairly close to each other and the scale on x-axis does not indicate too high leverage for any of the points, indicating that there are no outliers which would have an anomalously strong impact on the model.

conclusion 2: Model assumptions are met fairly well, which means that we can trust the result summarized in conclusion 1.


3. Logistic regression

3.1-3.2 Dataset: alcohol use in secondary school students

alc <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt" ,sep=",",header = TRUE)
names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data is a combination of questionnaire and school-collected data of 382 students, collected from two secondary schools in Portugal (see Cortez & Silva, 2008). Dataset includes some basic background information on students (e.g. gender and age), their family relations, school history, and grades (G1-G3, averages of maths and portugues grades for 1st, 2nd and 3rd period). Students have also been asked about time and motivation they spend on studies, and some questions about their free time use1.

Average of alcohol consumption (alc_use) has been calculated from average of scores (1=very low, 5=very high) with which student described their alcohol use during week or in the week ends. Students with an average alcohol use score above 2 are considered to have a high alcohol use.

3.3 Aim of study and hypotheses

From a dataset described above, several interesting questions related to risk factors or consequences of high alcohol use (such as absences, failing a class, poor grades, parental support or general family situation) could be investigated. However, my interest lies more in risk-reducing aspects related to social support and aspirations of the secondary school students. In other words, I want to identify factors which reduce the chances that a student develops a high alcohol use.

Specifically, my aim is to study whether family relations (high values in famrel), tendency to go out with friends (high values in goout), being in a romantic relationship (binary variable romantic) or wish to take a higher education (binary variable higher) are related to high alcohol use among Portuguese secondary school students.

The predicted relationships between these variables are:

  • H1. Good family relations reduce the chances of developing a high alcohol use. In secondary school students, high alcohol use might be related to the rebelling against authorities. On the other hand, in students who will occasionally use alcohol, I would assume that a good relationship with parents enables parental intervention before the alcohol use in their kids increases.

  • H2. Tendency to go out with your friends will probably increase the occurrences of alcohol use, and thus increase the probability for high alcohol use. The underlying assumption is that among secondary school students, it is most common to use alcohol with your friends.

  • H3. Being in a romantic relationship reduces the probability of high alcohol use. In some students, the reason for going out to party (which would, in most cases, include alcohol) might be a wish to find someone. Being in a relationship might decrease the need to go searching or otherwise alter the way the students use their free time in directions which involve less alcohol.

  • H4. Wish to take higher studies reduces probability for high alcohol use, since the students with high ambition or career dreams might be more likely to spend more time on their studies, thus decreasing the time spent on parties.

3.4 Data exploration: family relations/study ambitions maybe linked to low alcohol consumption, and going out maybe linked to high alcohol consumption

Distributions of chosen variables
For graphical exploration using functions gather() and glimpse(), let’s look at the subset of the original dataset. For data exploration purposes, we will include also the variable alc_use which was used for defining the target variables high_use.

variables <- c("famrel","goout","romantic","higher","alc_use","high_use")
alc_subset <- subset(alc,select = variables)

gather(alc_subset) %>% glimpse()#gather() reorders the subsetted dataset (6 x 382 variables) into two columns "key" (variable names) and "value", so that the gathered dataset is a matrix of 2292 (=6x382) rows and 2 columns 
## Observations: 2,292
## Variables: 2
## $ key   <chr> "famrel", "famrel", "famrel", "famrel", "famrel", "famre...
## $ value <chr> "4", "5", "4", "3", "4", "5", "4", "4", "4", "5", "3", "...
gather(alc_subset) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + ggtitle("Variable distributions") + theme(plot.title = element_text(hjust=0.5))

From the bar plots of numerical variables (on the first row), we can see that goout seems normally distributed but distributions of alcohol use (alc_use) and family relationships famrel are skewed. Most of the secondary school students use very little alcohol and have fairly good relationship with their families.

From the binary variables (on the second row), we can see that most of the students do not have a high alcohol use and are not in a romantic relationship, but grand majority of them do hope to pursue higher education. This is maybe good news for the students but may indicate poor predictive power for the variable higher, unless every student not wishing to aim for higher education uses a lot of alcohol, see hypothesis 4.

Relationships between the target and predictor variables
To start exploring how predictor variables might be related to the alcohol use or high alcohol use, let’s tabulate the number of students in each group and mean alcohol use per each group of categorical predictor variables.

alc_subset %>% group_by(romantic,higher) %>% summarise(count = n(), mean_famrel=mean(famrel), mean_goout=mean(goout), mean_alcuse = mean(alc_use))
## # A tibble: 4 x 6
## # Groups:   romantic [2]
##   romantic higher count mean_famrel mean_goout mean_alcuse
##   <fct>    <fct>  <int>       <dbl>      <dbl>       <dbl>
## 1 no       no         7        3.86       3.29        2.36
## 2 no       yes      254        3.99       3.10        1.84
## 3 yes      no        11        3.73       3.36        2.27
## 4 yes      yes      110        3.85       3.1         1.90

From this table, it looks like an aspiration for higher education decreases alcohol use: mean alcohol use for the 18 students not hoping to continue their studies is nearly 0.5 units higher than in 364 students who said they would like to continue to higher education. Since the alcohol use in these groups fall on opposing sides of the limit for high use (2), this finding supports hypothesis 4. It might be noteworthy that the lower alcohol use in students who wish to go for higher education may be related to more intense studying (or less friends eager to go out), since also the mean of goout is lower in this group.

Unlike the aspiration for higher education, being in a romantic relationship does not seem to affect alcohol use, so the means look like hypothesis 3 is not supported by the data. People in romantic relationship seem to have slightly lower family relationships (maybe parents have opinions on other things besides alcohol use in Portugal), but the difference in scores is really small.

Using means to summarize data with skewed distributions can be problematic, so let’s check the variability in alcohol use within in each group by looking at the following boxplot. Relationship status (romantic) is shown with different colors and the limit between low/high alcohol use is shown by a dashed horisontal line.

g1 <- ggplot(alc_subset,aes(x=higher,y=alc_use,col=romantic)) 
g1 + geom_boxplot() + ylab("alcohol use") + xlab("wishes to continue studying after school") + ggtitle("Mean alcohol use") + theme_classic() + geom_hline(yintercept = 2,linetype="dashed",color="gray")

It looks like we can continue believing in hypothesis 4.

From this boxplot, it looks like it might be possible to find an interactive effect of romantic relationship and aspiration of higher studies on alcohol use - at least, if there is a reason to suspect the high alcohol use of a who is in a relationship but does not want to continue studying after secondary school. If this student can be rejected as an outlier, it seems possible that there could be an effect of relatioship within students without aspirations for higher education. However, if we don’t have a good reason to suspect exceptional circumstances or change the hypothesis to exclude this outlier, it still does not seem that the hypothesis 3 would be supported by data.

Exploring the potential effect of numerical variables might be easier if we tabulate the means scores for students with or without high alcohol use.

alc_subset %>% group_by(high_use) %>% summarise(count = n(), mean_famrel=mean(famrel), mean_goout=mean(goout))
## # A tibble: 2 x 4
##   high_use count mean_famrel mean_goout
##   <lgl>    <int>       <dbl>      <dbl>
## 1 FALSE      270        4.01       2.86
## 2 TRUE       112        3.76       3.73

From this table, it looks like high alcohol use is related to lower score in family relationships and higher going-out rate (supporting hypotheses 1 and 2). Let’s try checking also this graphically:

g2 <- ggplot(alc_subset,aes(x=high_use,y=famrel)) + geom_boxplot() + ylab("quality of family relations") + xlab("high alcohol use") + ggtitle("Family relations") + theme_classic()
g3 <- ggplot(alc_subset,aes(x=high_use,y=goout)) + geom_boxplot() + ylab("going out with friends") + xlab("high alcohol use") + ggtitle("Going out with friends") + theme_classic()
grid.arrange(g2,g3,ncol=2)

The boxes (depicting the majority of observations) do not overlap in either figure, so the distribution of variables famrel and goout in students with low/high alcohol use continue to support hypotheses 1 and 2 as described above.

3.5 Logistic regression model: good family relations decrease and going out with friends increases the probability for high alcohol use

Let us fit a logistic regression model m1 to study the statistical probability that high alcohol use is related to family relations, going out with friends, being in a romantic relationship or aspirations for higher education. After fitting the model, we will print out the summary of it.

m1 <- glm(high_use ~ famrel + goout + romantic + higher, data=alc, family="binomial")
summary(m1)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + romantic + higher, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6697  -0.7554  -0.5397   0.9551   2.4742  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.8306     0.8243  -1.008   0.3136    
## famrel       -0.4306     0.1339  -3.216   0.0013 ** 
## goout         0.7952     0.1189   6.690 2.23e-11 ***
## romanticyes  -0.3143     0.2735  -1.149   0.2505    
## higheryes    -0.9408     0.5701  -1.650   0.0989 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 398.81  on 377  degrees of freedom
## AIC: 408.81
## 
## Number of Fisher Scoring iterations: 4

The model summary shows a reminder of model structure and some information on the residuals (on top) and the general deviance (goodness of fit, see e.g. here, on bottom). For interpreting the results the important part is in the table of Coefficients between them. Now that the nonnumerical factor variables were binary, we do not need to worry about the contrasts, since there is only two classes of variables to compare in pairwise differences.

Interpretation of summary of m1

  • From the estimates of coefficients for each variable, we see that variable famrel and answering yes in the questions about romantic and higher is negatively related to high alcohol use (estimate < 0), and the variable goout is positively related to high alcohol use (estimate > 0). These estimates would seem to support our hypotheses 1-4, but when we look at the P-values, it looks like only the effects of family relations (P=0.001) and going out with friends (P<0.001) are statistically significant. Very low P-values associated with these variables tell that if family relations or going out with friends does not truely affect the probability for high alcohol use, it would be highly unlike to get a dataset like this.

  • The P-value related to romantic is not statistically significant (P=0.251>0.05), meaning that even if students in a romantic relationship seemed to have less tendency for high alcohol use in this dataset, there is so much variability that the result is not clear. From this data, we cannot conclude that the relationship would affect the probability of high alcohol use – at least without collecting more data.

  • P-value associated with aspirations for higher education is between 0.05 and 0.1. This range is sometimes called marginally significant - it is on a range when we can say that there is a pattern in the data, but the variability in data is still a bit too high to conclude that this pattern was not caused by chance.

To calculate the odd ratios (OR), we can take exponent of each of the estimates and the upper and lower limits of confidence intervals.

OR <- exp(coef(m1))
CI <- exp(confint(m1))
## Waiting for profiling to be done...
cbind(OR,CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.4357791 0.08335217 2.1406938
## famrel      0.6501495 0.49849143 0.8440722
## goout       2.2148780 1.76606242 2.8171861
## romanticyes 0.7302921 0.42266511 1.2387525
## higheryes   0.3902971 0.12597479 1.2041032

Interpretation of odds ratios from m1
Odds ratio tells the ratio of conditional probabilities of successes compared to failures. In this case, success means that a student has a high alcohol use, whereas a failure means that a student doesn’t have a high alcohol use. The conditions are defined by explanatory variables we put to the model – so, for example the OR related to the variable higheryes (which means that the student answered “yes” to the question about higher education) tells us that if the student wishes to continue studying after secondary school, he/she has about 40% as likely to have a high alcohol use as the student who doesn’t want to continue studying (or conversely, that a student who doesn’t want to continue studying after secondary school is about 2.5 times as likely to have a high alcohol use (100%/40%=2,5). For numerical variables (famrel and goout), the odd ratio tells us how much an increase of one score unit in the variable increases the chance that the student has a high alcohol use.

If we combine the results from this table to those of the model summary, we should stick to interpreting ORs for statistically significant variables (for the non-significant romanticyes and higheryes, the confidence interval includes value 1, which means that based on this model, we cannot really say whether being in a relationship or wishing to continue studies increases or decreases probability of having a high alcohol use).

Conclusions on our hypotheses based on the logistic regression model
Hypothesis 1 and 2 are supported by data and logistic regression, hypothesis 3 and 4 are not.

  • H1: Good family relations decrease the probability that secondary school students have a high alcohol use (P=0.001). An improvement of 1 in “family relations”-score decreases the probability that the students have a high alcohol consumption to 50-84% (95% CI) compared to students with worse family relations. (Or a decrease in family relations increases the probability of high alcohol use by a factor of 1.5 (100/65=1.54)).

  • H2: Going out with friends increases the probability for high alcohol consumption in secondary school students (P<0.001). An increase of 1 in “going out”-score increases the probability that the students consume a lot of alcohol to about 1.8-2.8-fold compared to students who go out less.

  • H3: Being in a romantic relationship does not affect the chance of having a high alcohol use (P=0.25).

  • H4: There is no sufficient evidence to conclude that hoping to continue to higher studies would have an effect on high alcohol use in secondary school students (P=0.099).

3.6 The model correctly predicts the high alcohol use in 76% of the students

Comment on results, compare model performance with simple-guessing performance

Here, we were asked to use only the statistically significant predictors (the most parsimonous model available), so we will make a new model m1_simple where we have dropped the non-significant predictors.

m1_simple <- glm(high_use ~ famrel + goout, data=alc, family="binomial")
AIC(m1,m1_simple)
##           df      AIC
## m1         5 408.8108
## m1_simple  3 408.3245

From table above, we can note that the AIC-value associated m1_simple (408.32) is only slightly lower than the AIC-value of the original model m1 (408.81), so dropping these variables did not necessary improve model fit by much, but let’s still continue with the simpler model.

We will calculate predicted probabilities, add them to the original dataset, and create a new column prediction for checking whether the model would have predicted the specific student to have high alcohol use. Then, we will cross-tabulate the results to see how well the predictions fitted the actual data.

probabilities <- predict(m1_simple, type = "response") 
alc <- mutate(alc, probability = probabilities) 
alc <- mutate(alc, prediction = probability>0.5) 
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   243   27
##    TRUE     64   48

From the table above, we cases where the prediction was correct are on the diagonal (FALSE was predicted as FALSE, or TRUE was predicted as TRUE), so we can calculate the training error as error = (64+27)/382 = 0.238…= 24 %. Specifically, we can see that 7% of students who did use a lot of alcohol were incorrectly predicted to have a high alcohol use (27/382=0.071), and that nearly 17% of the students who used a lot of alcohol were not found by the model (64/382=0.168).

The model correctly predicted the high-alcohol use in 76% of secondary school students. It is possible that a model with a higher fit could be found by exploring the data further – since the error rate was higher in students using a lot of alcohol, maybe adding some variable highly correlated with alcohol use to the model could improve fit and predictive power of the model.

Based on the initial guesses on the relationships between the selected explanatory variables famrel, goout, romantic and higher, two out of 4 hypotheses were shown to be correct, one was incorrect (romantic did not have the guessed effect on high alcohol use, which - after the analysis - feels like the most far-streched of the hypotheses), and one (the effect of study ambitions on high alcohol use) was neither proved nor disproved. So, crudely comparing the guessed hypotheses (2/4 correct, one incorrect) to the training error of the model, I would say that the error rate of guessed and modelled connections in the data is about the same, but the model had a better ‘success rate’. (But the validity of this comparison can – and should be questioned, since I doubt the errors can actually seen to be on the same scale). Personally, I think both approaches are needed: scientists making educated guesses to base the hypotheses, which should then be assessed and quantified by using available statistical tools.


4. Clustering and classification

4.1-4.2 Dataset: Boston data (from MASS-package)

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Dataset Boston contains information about housing values in Suburbs of Boston. The dataset describes 506 towns (rows) around Boston by 14 variables including the most common value of houses (medv). Numerical variables contain estimates for e.g. crime, pollution (nitrogen oxygen concentrations, nox), average number of rooms per home (rm), school size (pupil-teacher ratio, ptratio). There’s also two nonnumerical variables (binary chas indicating location at riverside, and ordered index rad describing access to the radial highways)2.

4.3 Exploration of Boston-data

4.3a Variable distributions

Let’s start looking at variable distributions using boxplots (in combination with the gather()-function, which allows us to stack the columns of the dataset Boston (506 x 14 matrix) into a 2 x 7084 matrix, where the first column contains names of variables in Boston (key) and the second contains values of the variables (value) in each town area around Boston. We also modify the ggplot() code from the last weeks exercises to show a boxplot instead of histograms.

dim(gather(Boston))
## [1] 7084    2
gather(Boston) %>% ggplot(aes(key,value)) + facet_wrap("key", scales = "free") + geom_boxplot() + ggtitle("Variable distributions") + theme(plot.title = element_text(hjust=0.5)) #geom_boxplot() defines that the desired type is boxplot, and we need to change the first argument into aes(x,y) to define that we want to show distribution of values in each variable (key)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

From boxplots, we can see that variables are shown on very different scales, the distribution of most variables is moderately to highly skewed, and the variance also varies a lot. For example, it looks like majority of town areas are located relatively close to the radial high-way (rad) and have a fairly low property-tax (measured by tax). The summary() command shows us the same information visualized in boxplots.

It might be interesting to see if distributions of tax or rad (with very wide interquantile ranges) actually consist of two populations. Let’s do histograms!

par(mfrow=c(1,2))
hist(Boston$tax)
hist(Boston$rad)

Bingo! Besides the bulk of areas with lower tax value, there are some areas with extremely high taxes. A similar distribution is observable in the distance to highways (rad). To be thorough, let’s replace the initial boxplots by violin plots, which show multimodal distributions more clearly.

gather(Boston) %>% ggplot(aes(key,value)) + facet_wrap("key", scales = "free") + geom_violin(draw_quantiles = c(0.5)) + ggtitle("Variable distributions") + theme(plot.title = element_text(hjust=0.5)) + geom_point(position = position_jitter(0.1),alpha=0.01) #changed geom_boxplot into a geom_violin (with dra_quantiles() specifying that I want to draw a median), and added geom_points to show that I want to show points on top of the distributions (position_jitter spreads the points to make them more visible, and alpha adds transparency so that the points don't steal the show from distributions)

4.3b Relationships between variables

To see which variables could potentially be related, we will calculate correlations between all variables and save it into cor_matrix and use this to draw a correlation plot using corrplot(). In the following graps, the size and intensity of circles corresponding to correlation between variables. Blue circles mean positive correlations, red circles mean negative correlations.

cor_matrix <- cor(Boston)
corrplot(cor_matrix,method = "circle",type="upper",cl.pos = "b",tl.pos = "d",tl.cex = 0.6)

From the plot, it looks like levels of industriality of the zone is positively related to levels of nitrogen oxides and negatively to the distance from employment centers, or accessibility to radial highways is strongly related to property tax-values, for example. Additionally, proportion of old buildings (age) seems to be lower closer to the highways (negative association with rad). It looks like the value of houses can be decreased by multiple causes (lot’s of red circles in the last column), but it seems positively related to the mean number of rooms (rm, not maybe a big surprise) in apartments.

4.4 Modifications of the dataset

4.4a Standardization

Different absolute values and variances might affect the multivariate methods, so to treat all variables equally, we will center each variable around its mean and scale it so that they have equal variances. This is a quick-and-dirty way to transform unimodal data closer to a normal distribution.

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)

Now mean of each variable is zero and the range of variables (shown by distance from min to max, which is now in the range of -5 to 10 in all variables) are more similar than in the original Boston (where variable ranges could be as different as 0-1 vs. 0-400).

4.4b Change numerical crim into a categorical crime

To create four crime-classes of equal size, we will use quantiles to define whether a town can be defined to have “low”, “medium low”, “medium high” or “high” crime, based on (standardized) per capita crime rate crim.

bins <- quantile(boston_scaled$crim) #vector defining the cutting values for different categories of crime
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high")) #create new variables crime into boston_scaled, with selected names of categories
boston_scaled <- dplyr::select(boston_scaled, -crim) #remove original crim from data
boston_scaled <- data.frame(boston_scaled, crime) #add new crime into data
4.4c Create training and testing datasets

We are planning to do linear discriminant analysis, which is a multivariate classification methods, later. In order to test if the classification is correct, we will split the dataset randomly into training set (containing 80% of the towns) and testing set (containing the remaining 20% of the towns).

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8) #choose randomly 80% of rows (these will be in the training data set)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,] #testing dataset consists of all rows not selected for the training

4.5 Linear discriminant analysis for crime rate

In order to find out which variables are typically related to each of the crime classes, we will do linear discriminant analysis (LDA) on crime with other 13 variables as predictors, using the training dataset we created above. Besides fitting the LDA-model, we will show the results in a biplot.

lda.fit <- lda(crime~., data = train) #this creates the LDA-model

classes <- as.numeric(train$crime) #to avoid having to worry about factor levels or such, we will create a numeric copy (classes) of the categorical "crime" variable
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)} #function drawing arrows that will be added to the plot above to make it into a biplot

plot(lda.fit, dimen = 2,col=classes,pch=classes,cex=0.5,main="LDA for crime rate") # plotting the cases (towns) based on their distances in two dimensions. The color already separates different classes, so I reduced the symbol size (using the argument cex)
lda.arrows(lda.fit, myscale = 1.3) # add arrows (labeled with the variables they represent) making the distance plot into a biplot

From this plot, we see that based on all non-crime variables, town areas with high crime can be quite clearly separated from areas with less crime, although there a few “medium-high crime” areas resembling “high crime” areas (green plots in the cluster of blue) and one blue “high crime” clustering with majority of green “medium-high crime”. The long rad arrow tells that the variable most clearly separating “high crime” from other classes is the accessibility from the high-crime-rate-areas to the high-way.
The other crime classes are less separated from each other based on non-crime variables, but their order from low to medium high (black-red-green) seems logical and suggests. The arrows are shorter and on top of each other, suggesting their predictive power for crime rate is lower, but it looks like high values on zn (related to zoning large areas for residential use) could be somewhat related to lower crime rate in the area.

4.6 Does the LDA fitted above predict crime correctly?

Based on the all non-crime variables in the test data set we will predict their crime classes, and cross-tabulate these predictions with the actual crime classes.

correct_classes <- test$crime # save the correct classes from test data
test <- dplyr::select(test, -crime) # remove the crime variable from test data

lda.pred <- predict(lda.fit, newdata = test) 
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15      11        0    0
##   med_low    2      13        7    0
##   med_high   0       6       18    0
##   high       0       0        0   30

If the prediction for crime level between town around Boston was perfect, everything but the top-left–bottom-right diagonal of this table would be zeros. Now, we can see that the lda.fit predicted nearly all towns with high crime correctly in the test data, but within the lower-than-high crime town areas, the predictive power of the model was lower. (The exact numbers depend on which rows were randomly selected from the original dataset, so they change every time the code is run). The difficulty of predicting the towns with lower-than-high crime was already visible in the LDA-plot based on training dataset (above): towns with low, medium-low and medium-high crime (black, red and green symbols, respectively) did not show up as distinct clusters in the plot.

4.7 Clustering of town areas around Boston

In principle I think we could use the same dataset we loaded before (since we didn’t do any direct changes to it), but just in case: let’s reload and standardize the data.

data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...

For clustering, we will calculate Euclidean distances (smaller the distance, more similar the towns) between each pair of towns that can be created from 506 towns. Using standardized dataset means that variables with highest initial variability (like tax, ranging from 187-711 in the original data) do have much larger weight than variables measured on smaller scale (like nitrous oxide concentrations, ranging from 0.39-0.87).
These distances are used in k-means, which we will first decide to do based on 3 clusters.

dist_eu <- dist(boston_scaled) # calculate distances and save them into a new object
km <-kmeans(boston_scaled, centers = 3) #kmeans algorithm

The algorithm kmeans() now looks for the optimal way to cluster the data into three groups (defined by argument centers = 3). It does this by trying out different group centroids, and for each of them, calculating the Euclidean distance of all samples to the group centroid. In the optimal solutions, groups are “located” so that the distance of all observations to groups centroids are minimal.
The problem is that the algorithm doesn’t know if some other number of clusters would allow an even better way to group data, so we will explore this by calculating the within cluster sum of squares (WCSS, kind of a multivariate variance related to each solution) for solutions with 1-10 clusters and plot them to see when the WSCC drops.

set.seed(1988) #unless specified otherwise, kmeans() selects a random location for each cluster it makes when calculating twcss; this could in principle affect the plot, so we will tell it to use this number instead
k_max <- 30 #max number of clusters - if the plot doesn't seem to stabilize we could set it higher
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss}) #create a vector containing total WCSS-values calculated for each 1-10 solution of kmeans
qplot(x = 1:k_max, y = twcss, geom = 'line') #plot total WCSS against number of clusters

This case does not seem as clear as the DataCamp example, since the curve doesn’t become “bumpy” or really start to level before k=5, but the most abrupt drop of the curve ends somewhere between 2 and 3, so we will select to do clustering with k=3. (A number as high as 5 could maybe be argumented for, but to keep interpretation simpler, we will take the more parsimonous solution.)
We will run the kmeans() again with this number of clusters and plot the solution to see how our data was grouped.

km_3 <-kmeans(boston_scaled, centers = 3) 
pairs(boston_scaled[1:5],col=km_3$cluster)

pairs(boston_scaled[6:10],col=km_3$cluster)

pairs(boston_scaled[11:14],col=km_3$cluster)

  • Dividing the data into 3 clusters shows that high crime rate is mostly found in the cluster marked with black. Other variables typical to this cluster seem to be high values in tax and rad, for example. (So, high crime rate in town might be correlated with high-value properties maybe describing wealthiness, and either high or low accessibility to the highways, depending on what accessibility index 1 meant.

  • Red cluster seems to be related to low tax, high dist, low nox or ind and high medv. This could support the interpretation that these are residential areas, from where people commute to industrial/commercial centers.

  • Green cluster consists of houses by the river (chas=1), but it is tricky to see other similarities in these towns, except potentially the low high pupil to teacher ratio (ptratio)

A potentially noteworthy thing is that the red and black clusters seem to be separate fairly well when the value of houses (measured by medv) is plotted against lstat, which is explained as “lower status of the population”. Without knowing how this variable is defined (or going to the potential problems of defining some people to have lower status than the others), it would look like the clustering we did relates to the higher/lower valued areas around Boston, and the visual inspection of data looks like the population with higher/lower status is unevenly distributed in these areas. This could represent a case of regional (economical, societal) segregation. Alternative explanation is related to zonation of different towns to either commercial/industrial (black) or residential (red) areas.

4.8 Bonus: which variables affect most the 5-fold clustering of Boston data?

Let’s do a 5-fold k-means clustering on standardized Boston data.

km_5 <-kmeans(boston_scaled, centers = 5) # clustering of Boston into 5 categories
str(km_5$cluster) # 1 x 506 matrix (or vector with 506 elements), containing cluster indexes for each town
##  Named int [1:506] 2 5 2 2 2 5 5 5 5 5 ...
##  - attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
boston_scaled$cluster <- as.factor(km_5$cluster) # add cluster indexes (coded as factor) into 'boston_scaled' dataframe (506 x 14 matrix)

From visual inspection above, we can get an idea of the variables that tend to have high (e.g. rm, rooms per dwelling, in red) or low (e.g. crime in green) values within each cluster. To umambiguously explore the variables which are behind this particular clustering, we can use linear discriminant analysis to determine which (linear combination of) variables affect these five clusters the most. Now that we are not planning to use the LDA-model for predictions, there is no need to create a separate training and testing sets.

lda_5clusters <- lda(cluster~., data = boston_scaled) #LDA-model with 'cluster' as predictor and all other variables (can use str() to check that they are numerical) as predictor variables
lda_5clusters$means #prints mean variables within each cluster
##         crim         zn      indus        chas        nox         rm
## 1  3.0022987 -0.4872402  1.0149946 -0.27232907  1.0593345 -1.3064650
## 2 -0.3981339  1.2930469 -0.9902994 -0.01202492 -0.8283387  1.0566896
## 3  0.5173303 -0.4872402  1.0149946  0.03346513  1.0036872 -0.1109215
## 4 -0.2753323 -0.4872402  1.5337294  0.16512651  1.1273809 -0.6003284
## 5 -0.3884901 -0.3308141 -0.4873088 -0.01513156 -0.4761310 -0.2318056
##          age        dis        rad         tax    ptratio      black
## 1  0.9805356 -1.0484716  1.6596029  1.52941294  0.8057784 -1.1906614
## 2 -0.9121115  0.9121798 -0.5955687 -0.67325561 -0.8788401  0.3554817
## 3  0.6904986 -0.7599982  1.6596029  1.52941294  0.8057784 -0.6275266
## 4  0.9334996 -0.8995039 -0.6096828  0.01485481 -0.3917541 -0.1262348
## 5 -0.1989968  0.2356027 -0.5732709 -0.60914944  0.1061891  0.3164212
##        lstat       medv
## 1  1.8708759 -1.3102002
## 2 -0.9544043  1.0995470
## 3  0.5406100 -0.4851454
## 4  0.6474932 -0.4618433
## 5 -0.1478389 -0.1012054
classes <- as.numeric(boston_scaled$cluster) 
plot(lda_5clusters, dimen = 2,col=classes,pch=classes,main="LDA for 5 k-means clusters") 
lda.arrows(lda_5clusters, myscale = 1.5) # playing with argument 'myscale' allows us to zoom in with the arrows, and see in which direction the less-clearly affecting variables seem to increase or decrease among towns

This biplot shows us that when looked in two dimensions, the towns/areas around Boston can be quite unambiguously divided into two main groups. First group contains two non-overlapping clusters, 1 and 3 (black and green, respectively), and second group contains clusters 2, 4 and 5 (red, blue and turquoise). In the latter group, clusters are somewhat separated on the second axis of the two dimensional distance plot, but there are some overlapping areas (red towns among turquoise and blue town). From this figure, it looks like the 5 clusters might be too much (and the noise is starting to affect the clusters), but we need to remember that we see only two dimensions of the data.

The arrows of the biplot tell us that accessibility to radial highways (rad) is the variable that separates black (6% of areas, shown by prior probabilities in the printout of the model above) and green clusters (24% of areas) from others. The second direction of variability (separating dark blue from red cluster) is composed mainly of the variable ind (proportion of non-retail business acres, i.e., area zoned for industry per town), and in smaller extent, by variable nox.

So, based on this analysis clusters 1 (black) and 3 (green) represent areas with small index of accessibility to the radial highways. Comparing the group means for each cluster, the towns in black cluster 1 have the highest crime rate, and quite small apartments (low rm, rooms per dwelling) and low value of owner-occupied homes (medv) compared to the towns in green cluster.

Of the towns with high index of accessibility to the radial highways, towns in the blue cluster 4 clearly rely on industry, which is partially reflected in their lower air quality (high nox). The turquoise (cluster 5) and red clusters (2) are mainly separated by larger proportion of very large residential areas (zn) and higher value of owner-occupied houses in the towns belonging to red cluster.

4.9 Super-bonus: going beyond two dimensions

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling) 
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
str(matrix_product)
## 'data.frame':    404 obs. of  3 variables:
##  $ LD1: num  5.98 -1.12 5.84 -1.15 -2.76 ...
##  $ LD2: num  -0.185 -1.676 0.262 -0.302 2.691 ...
##  $ LD3: num  -0.6908 -0.0675 -0.157 -0.4263 -1.8714 ...
matrix_product$crime <- as.factor(train$crime) # add crime to the matrix product

#3D plot for crime classes
require(plotly)
plot_ly(matrix_product,x = ~LD1, y = ~LD2, z = ~LD3, color=~crime, colors=c('black','red','green','blue'),type='scatter3d',mode='marker')

This 3D plot corresponds to 2-dimensional LDA for crime rate we drew in section 4.5, except that it shows also the variation on the third linear component, LD3. By rotating the 3D figure and zooming in or out, we can find a position in which it remembles the original 2D plot.

Let’s try to draw a similar figure with colors defined by 5 clusters we used for the k-means in the bonus step.

model_predictors1 <- dplyr::select(boston_scaled, -cluster)

# check the dimensions
dim(model_predictors1) # 506 x 14
## [1] 506  14
dim(lda_5clusters$scaling) # 14 x 4
## [1] 14  4
# matrix multiplication
matrix_product1 <- as.matrix(model_predictors1) %*% lda_5clusters$scaling
matrix_product1 <- as.data.frame(matrix_product1)
str(matrix_product1) # 506 x 4 variables
## 'data.frame':    506 obs. of  4 variables:
##  $ LD1: num  7.01 6.09 6.32 5.99 6.03 ...
##  $ LD2: num  -0.796 0.36 -0.524 -2.203 -2.337 ...
##  $ LD3: num  -0.44373 -1.38016 0.00296 -1.08547 -0.87988 ...
##  $ LD4: num  -0.181 -0.693 -0.253 -0.594 -0.21 ...
matrix_product1$cluster <- as.factor(boston_scaled$cluster) # add cluster to the matrix product

#3D plot for clusters
plot_ly(matrix_product1,x = ~LD1, y = ~LD2, z = ~LD3, color=~cluster, colors=c('black','red','green','blue','turquoise'),type='scatter3d',mode='marker')

This plot can be turned into a position resembling the 2D-plot we did for clustering of towns in step 4.8 (Bonus).

If we turn the plots produced in this section until the axes point into the same direction, we can see that in the crime rate 3D-plot there is a big diffuse cluster of towns marked with black, red and green (low, medium-low and medium-high crime) plotted on the left, and a smaller group of blue towns (high crime) plotted on the right. In the 5-fold cluster 3D-plot, it looks like LD1 similarly separates the towns into a big diffuse (clusters 2, 4, and 5, marked with blue, turquoise and red) and a smaller distinct (clusters 1 and 3, marked with green and black) groups, even if the colors are different and axis LD1 has been rotated.

lda.fit$scaling # effect of variables on crime rate based on 404 observations
##                 LD1         LD2         LD3
## zn       0.09009011  0.75205264 -1.05517919
## indus    0.01577899 -0.21600918  0.16476355
## chas    -0.07981356 -0.17947020  0.06668779
## nox      0.41795365 -0.77887703 -1.33797506
## rm      -0.08622754 -0.08035791 -0.12004300
## age      0.20258311 -0.21218725 -0.08566340
## dis     -0.07942412 -0.29446801  0.20696085
## rad      2.97878486  1.06347334 -0.18612299
## tax     -0.01683829 -0.19116738  0.76585945
## ptratio  0.14283373  0.02917238 -0.34426491
## black   -0.12164657  0.01476643  0.10981425
## lstat    0.22805678 -0.19590193  0.33237077
## medv     0.18728380 -0.35095028 -0.23777618
lda_5clusters$scaling # effect of variables on 5-fold clustering based on 506 observations
##                 LD1         LD2          LD3         LD4
## crim    -0.28647981  0.05899077 -0.586922723  1.52650170
## zn      -0.33551566 -0.16095033  1.105885851  0.56806332
## indus   -0.52403039  2.03420936  1.252109510  0.09882186
## chas     0.05216311 -0.03235858 -0.135918635 -0.08589036
## nox     -0.03382741  0.54599205  0.738549264  0.09957505
## rm       0.13185191 -0.43428102  0.683713433 -0.19346167
## age      0.02984712  0.27272862 -0.109183447 -0.13317971
## dis      0.18458605  0.09487783  0.302292114  0.01142951
## rad     -5.90338447 -1.90213948 -0.068082162 -1.03092682
## tax     -0.13642105 -0.09829639  0.204678148  0.05933799
## ptratio -0.19707552 -0.09916193 -0.200284491 -0.12465713
## black    0.04995340 -0.07452601 -0.004099398 -0.10394595
## lstat   -0.03611606 -0.14379281 -0.053621469  0.69738701
## medv     0.05414554 -0.24307788  0.373552378  0.74459324

Checking the coefficients of two different LDA-models reveals that LD1 in crime rate model (lda.fit) was mostly affected by rad (coefficient 3.2 is clearly higher than for any other variables in column LD1). Similarly, rad was the main variable affecting the axis LD1 in the 5-cluster model but the coefficient is negative (-5.9), which could explain the fact that LD1s of these two models seem to mirror each other. This supports the idea that the towns marked with high crime (dark blue in the earlier plot) are mostly the same as towns clustering in 1 and 3 (black and green in the latter plot).

Additional variation to the plots arises from different sets of predictor variables and different observations used for the two LDA-analyses.


5. Dimensionality reduction techniques

5.1 Dataset: introducing dataset ‘human’

human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)
str(human) # 155 obs on 8 (numerical) variables
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
head(human,n=3) # check the first 3 rows to see that rownames are correctly specified
##               Edu2.FM   Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth
## Norway      1.0072389 0.8908297    17.5     81.6 64992       4       7.8
## Australia   0.9968288 0.8189415    20.2     82.4 42261       6      12.1
## Switzerland 0.9834369 0.8251001    15.8     83.0 56431       6       1.9
##             Parli.F
## Norway         39.6
## Australia      30.5
## Switzerland    28.5

Data consists of population-level data collected from 155 countries. Variables measured from each country include population-wide estimates (based on datasets collected by UN) on the education level and work situation (separated by genders and shown as ratio of females to males), expected duration of education and life expectancy (in years), GNI (gross national income per capita), maternal mortality (deaths/100000 live births), adolescent birth rate (births per 1000 women aged 15-19) and proportion of females in the parliament.

I did not manage to import data from the own github, so this work will be based on the ready-made dataset provided by the course, and the variable names will be different from those in create_human.R file. Since the dataset contains only numerical variables (countries are included only as row names), we can use ggpairs (from the package ‘GGally’) and corrplot (from the package ‘corrplot’) to visualise the distributions and relationships between variables in human.

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(human)

corrplot(cor(human),method = "circle",type="upper",cl.pos = "b",tl.pos = "d",tl.cex = 0.6)

Description and interpretation

  • From the summary() and axis scales of ggpairs(), we can see that GNI is given on a different very different scale (0-125 000) compared to other variables (0-10, or 0-100).

  • Histograms on the diagonal of ggpairs() reveal both bi- and unimodal distributions: it looks like two different groups of countries could be separated in terms of female-to-male education and work ratios (Edu2.FM and Labo.FM), life expectancies (Life.Exp) and maternal mortality (Mat.Mor), at least.

  • Correlation plots (corrplot()) show strong positive correlations between expected education (Edu.Exp) and life expectancy (Life.Exp) and maternal mortality and adolescent birtsh (Ado.Birth). Strongest negative correlations can be found between maternal mortality and life expectancy, expected education, or female-to-male education ratio. Adolescent birth is also quite negatively correlated with these variables.

5.2 PCA on non-standardized data

To see if we could effectively reduce the dimensions of (8-dimensional) human data, we will do the principal component analysis (PCA).

pca_human_nonstand <- prcomp(human) # perform PCA based on singular value decomposition (SVD)
s_nonstand <- summary(pca_human_nonstand) # save summary (containing 3 variables describing `importance` of 8 first principal component axes)
pca_pr <- round(100*s_nonstand$importance[2,],digits = 2) # calculate percentage (%) of variance contained within each principal component axis
pca_pr # print table
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00
biplot(pca_human_nonstand,choices=1:2,cex=c(0.7,0.9),col=c("grey40","red"),expand=0.8) # draws biplot, with expand set to 0.9 to prevent the arrow label from going below the frame

Plot 1. PCA performed on non-standardized human data separates countries mainly based on economical status of countries (GNI, gross national income per capita), but does not provide additional information.

The PCA result shows that the variance within 8 (non-standardized) variables could be suppressed almost completely on PC1 (includes >99% of variability in data), which, according to the biplot() consists almost solely of GNI, which was the variable getting highest values.

5.3 PCA on standardized data

To give each variable equal weight (regardless of absolute values), we will repeat the PCA on human data after standardization.

human_std <- scale(human) #centers each variable to its mean and scales the standard deviation to 1

pca_human <- prcomp(human_std) # perform PCA on standardized data

s <- summary(pca_human) # save summary
pca_pr <- round(100*s$importance[2,],digits = 2) # calculate percentage (%) of variance contained within each principal component axis
pca_pr # print % variance explained by each PC-axis
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 53.61 16.24  9.57  7.58  5.48  3.60  2.63  1.30

biplot(pca_human,choices=1:2,cex=c(0.6,0.9),col=c("grey60","red"),expand=0.8) # draws biplot, with expand set to 0.9 to prevent the arrow label from going below the frame

Plot 2. After standardization, PCA reveals a continuum from wealthy countries (with high GNI, education level and life expectancy) to poor countries (with high maternal mortality and adolescent birth rate). Gender equality in adult population (measured in terms of ability of women to work or participate in political activity, PC2) is unrelated to the relative wealthiness of the countries.

Comparison of non-standardized to standardized PCA, and interpretation

  • Comparing the tables variance explained in % for each PCA dimension, it is clear that standardizing the data allows more nuances to be shown (not all variance is shown on the first axis). In non-standardized data, PC1 included half of the total variance, and the second axis contained over 10%.

  • In the biplots, the most obvious diffence is in the length of arrows describing which variables contribute most to the PC1 and PC2. In non-standardized case (plot 1), only the effect of GNI (negatively associated with PC1) can be discerned.

  • In standardized case (plot 2), we can see from biplot that countries that are located far away from each other along PC1 (e.g. Australia and Chad) are different in terms of GNI, life expectancy, expected education and female-to-male education ratio (which are high in countries plotted on the left) and maternal mortality and adolescent birth rate (high in countries plotted on the right of biplot). The PC2 separates countries such as Rwanda and Iran, and the direction of arrows shows that proportion of women in working population or within parliament is higher on countries with high coordinate along PC2 (plotted on the top side of the biplot).

Different plots show clearly the effect of standardization of data. In non-standardized data, GNI (with the highest absolute variability) dominated the PCA results and masked the potential explanatory power of all other variables.

5.4 Reduced dimensions: wealthiness and gender equality (in adults)

Based on the reasoning below, I would suggest following interpretations for axes of PCA done on standardized data:
PC1 = wealthiness (economy, education, life expectancy vs. low reproductive equality) (on inverse scale)
PC2 = gender equality in adult population

  • PC1: Directionality of arrows in plot 2 (standardized case) indicates that economical status of countries (measured by GNI) is strongly associated with high (or at least long) education and life expectancy, and does not occur in countries with high maternal mortality or adulescent birth rate. Effects of these variables cannot be separated from each other, and if further analysis on any of them is made, they should be interpreted together. This axis seems to reflect the gradual shift from ‘first-world-countries’ to the ‘third-world-countries’.

  • PC2: Irrespective of the relative wealthiness (or variables correlated with PC1), the countries can be separated by activity of females in work or politics. This could reflect differences in equality of the society (how easy it is for women to get into or remain in the working life) or family support systems (is responsibility of daycare equally divided between parents, is it easy to find daycare if both parents are working). The orthogonality of these axes suggests that gender equality (of adult females) is unrelated to the relative wealthiness of the countries.

  • Although gender equality (in adult population) seems to be contained mainly on PC2, this interpretation does not seem valid for young/reproducing women (measured as female education, adolescent birth rate or maternal mortality). The highest female education is achieved in wealthiest countries (low scores on PC1) and is strongly negatively related to adolescent birth rate (girls who continue studying are less likely to have children very early), or maternal mortality (also probably related to available healthcare and/or wealthiness of the country). High adolescence birth rate in less-wealthy countries could also be related to lower life-expectancy - if the probability of both parents surviving to old age is lower, from family-planning perspective it does not make sense to wait too long before starting the family.

6. Analysis of longitudinal data

6.1 RATS data: effect of diet on rat weight

RATSL <- read.table("D:/Opiskelu/MOOC-kurssi/IODS-project/data/RATSL.txt",header=TRUE) 
RATSL$ID <- factor(RATSL$ID) # code categorical variables (ID and Group) as factors
RATSL$Group <- factor(RATSL$Group)

glimpse(RATSL) # check structure and first few values of data
## Observations: 176
## Variables: 4
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...
## $ weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
str(RATSL) # str() shows also number of levels for factor variables
## 'data.frame':    176 obs. of  4 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ weight: int  240 225 245 260 255 260 275 245 410 405 ...
summary(RATSL)
##        ID      Group       Time           weight     
##  1      : 11   1:88   Min.   : 1.00   Min.   :225.0  
##  2      : 11   2:44   1st Qu.:15.00   1st Qu.:267.0  
##  3      : 11   3:44   Median :36.00   Median :344.5  
##  4      : 11          Mean   :33.55   Mean   :384.5  
##  5      : 11          3rd Qu.:50.00   3rd Qu.:511.2  
##  6      : 11          Max.   :64.00   Max.   :628.0  
##  (Other):110

Dataset contains body weights of 16 rats (weight, in grams) which were fed with three different diets (Group). Weight of each rat (separated by ID) was measured on 11 occasions (Time, in days since start).

6.1a graphs

To start with, let’s plot weight development over time for individual rats, separately for rats fed with different diets.

ggplot(RATSL, aes(x = Time, y = weight, linetype = ID)) +
  geom_line() +
  theme_bw() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$weight), max(RATSL$weight)),name="Weight (grams)") +
  scale_x_continuous(name="Time (days)") +
  ggtitle("Weight of rat individuals over time")

On all diets (shown in different panels), rat weight seems to over time, but rats receiving diet 2 or 3 were substantially larger (already at the start of the experiment) than rats receiving diet 1. From this plot, it is difficult to say whether the growth rate (steepness of the increase in body weight) differs between diets, but there is definitely some individual variation in rat growth rates (e.g. continuous individual in group 2 grows slower than all other rats) or rat sizes (besides the “dense” of lines within each panel, there is an outlier individual which is smaller or larger than most rats in that treatment).

It might be possible to see individual rat responses better, if weights were standardized for each time point and group.

RATSL <- RATSL %>%
  group_by(Time,Group) %>%
  mutate(stdweight = (weight - mean(weight))/sd(weight)) %>%
  ungroup() # add a column stdweight to the dataset. For each observation, this column contains the difference to mean weight observed at that time in that group, scaled by dividing with standard deviation within that time and group.
str(RATSL)
## Classes 'tbl_df', 'tbl' and 'data.frame':    176 obs. of  5 variables:
##  $ ID       : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group    : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Time     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ weight   : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ stdweight: num  -0.698 -1.683 -0.37 0.616 0.287 ...
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() +
  theme_bw() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$stdweight), max(RATSL$stdweight)),name="standardized weight") +
  scale_x_continuous(name="Time (days)") +
  ggtitle("Difference to mean weight of individual rats over time")

Now we don’t see differences between groups, or the average (growing) trend in body weight of rats over time, but the rats with growth profile differing from the average can be really clearly seen. E.g. the difference between smaller than average rat in group 1 (dashed line on the bottom) to the group mean increases (line drops farther away from mean value 0). Keeping this rat in group 1 will probably affect estimated group responses. By increasing “noise” in the data it might decrease the chances that the model will show significant differences - but, without knowing more about this particular rat (Was it considerably more sickly or stressed than other rats in this group? Did it eat less, or was the difference in trends caused solely by effectiveness of growth), I would not dare to remove it as an outlier.

Now we can move from looking at individual rats into summaries of group responses (mean and standard error of mean).

n <- RATSL$Time %>% unique() %>% length() #makes a vector of unique values of Time, and counts the length of that vector (n = number of measurement days)
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(weight), se = sd(weight)/sqrt(n)) %>% ungroup() #create summary data containing mean and SE for each diet group and time point
glimpse(RATSS) # a data matrix of 33 rows (11 time points x 3 groups) and 4 variables
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group, color = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=1.5) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme_bw() +
  theme(legend.position = "right") +
  scale_y_continuous(name = "mean(weight) +/- se(weight)")+
  ggtitle("Mean rat weight on different diets over time")

From this plot, it looks clear that rat weight on diet 1 is lower than rat weights in group 2 and 3. Standard errors (error bars) of groups 2 and 3 do not overlap which might suggest that there is also a difference between these groups. However, since this difference was obvious since day 1, the difference in groups-specific mean weights does not tell us whether the different diets supported similar growth.

6.1b Rat body weight differs between diets

To study whether the rat body size is different on different diets (now ignoring the fact that repeated measurements from same rats are not independent), we can do an ANOVA on body weight, using diet group as grouping factor.

RATSS1 <- RATSL %>%
  group_by(ID,Group) %>%
  summarise(weight_mean=mean(weight)) %>%
  ungroup() # calculate mean weight of each individual rat in each group over 11 sampling dates
glimpse(RATSS1) # 16 rows (rats) belonging to 3 diet groups
## Observations: 16
## Variables: 3
## $ ID          <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ weight_mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, ...
ggplot(RATSS1, aes(x = Group, y = weight_mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), days 1-65") +
  scale_x_discrete(name = "Diet") +
  theme_bw() +
  ggtitle("Mean weight of rats on different diets")

w1 <- lm(weight_mean ~ Group,data=RATSS1)
anova(w1)
## Analysis of Variance Table
## 
## Response: weight_mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 236732  118366  88.073 2.763e-08 ***
## Residuals 13  17471    1344                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(w1)
## 
## Call:
## lm(formula = weight_mean ~ Group, data = RATSS1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.886 -27.031   2.284  10.588 105.750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   263.72      12.96  20.346 3.06e-11 ***
## Group2        220.99      22.45   9.844 2.16e-07 ***
## Group3        262.08      22.45  11.674 2.91e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.66 on 13 degrees of freedom
## Multiple R-squared:  0.9313, Adjusted R-squared:  0.9207 
## F-statistic: 88.07 on 2 and 13 DF,  p-value: 2.763e-08

F-test performed by anova() tells that diet affects (P<0.001) rat weight, and statistics related to model coefficients that rats growing on diets 2 and 3 have clearly a higher body weight than rats growing on diet 1 (P<001 for both), although the boxplot shows that distribution of observations in groups 2 and 3 are not normal (but, without knowing a good reason to remove these individuals as outliers, I would not do it).

6.1c … but effect is not significant when initial weight is included

But what happens if we take into account the different initial body size of rats in these groups?

RATS <- read.table("D:/Opiskelu/MOOC-kurssi/IODS-project/data/RATS.txt",header=TRUE) #importing the wide-format of data
RATSS1 <- RATSS1 %>%
  mutate(baseline = RATS$WD1)
glimpse(RATSS1)
## Observations: 16
## Variables: 4
## $ ID          <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ weight_mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, ...
## $ baseline    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, ...
w2 <- lm(weight_mean ~ baseline + Group,data=RATSS1)
anova(w1,w2) # LR-tests; when anova() is used on two linear models, it checks whether the fit of the more complex model is better than the simpler one.
## Analysis of Variance Table
## 
## Model 1: weight_mean ~ Group
## Model 2: weight_mean ~ baseline + Group
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     13 17471.4                                  
## 2     12  1352.4  1     16119 143.02 5.023e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(w2) #F-test
## Analysis of Variance Table
## 
## Response: weight_mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2237.0655 5.217e-15 ***
## Group      2    726     363    3.2219   0.07586 .  
## Residuals 12   1352     113                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(w2) #t-tests for model coefficients
## 
## Call:
## lm(formula = weight_mean ~ baseline + Group, data = RATSS1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.732  -3.812   1.991   6.889  13.455 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.14886   19.88779   1.516   0.1554    
## baseline     0.93194    0.07793  11.959 5.02e-08 ***
## Group2      31.68866   17.11189   1.852   0.0888 .  
## Group3      21.52296   21.13931   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9933 
## F-statistic: 747.8 on 3 and 12 DF,  p-value: 6.636e-14

6.1d Interpretation of results

  • Likelihood ratio (LR) test indicates that including the baseline into the weight model significantly increased model fit, so the more complex model w2 is better and should be used for interpretations.

  • The F-test related to this model tells that there is a trend for differences between diet groups (P=0.076 > 0.05), but the effect (at least without removing the outliers) is not clear enough to consider it statistically significant. Summary t-tests support the view that diet 2 supports the highest mean weight (when corrected for initial body weight), but the possibility that this is caused by chance cannot be excluded based on this analysis and data.

  • A very significant F-test value related to the baseline (P<0.001), and positive estimate for the baseline effect in t-tests (P<0.001) shows that the (initially) largest rats also have the highest mean weight measured over the experiment.

To conclude, we did not find enough evidence to state that the diet would affect rat body weight.

This analysis ignores the potential effect of different intervals between measurements, as well as the individual differences in rat growth profiles. Since we used mean over time as the (summary) response variable, assumption of independency was not violated, but we lost quite a bit of data compared to a mixed model approach, where also the within-subject (or within-rat) variability could have been modelled.

6.2 BPRS-data: scizofrenia symptoms in two treatment groups

BPRSL <- read.table("D:/Opiskelu/MOOC-kurssi/IODS-project/data/BPRSL.txt",header=TRUE)
BPRSL$treatment <- factor(BPRSL$treatment) # code categorical variables (treatment and subject) as factors
BPRSL$subject <- factor(BPRSL$subject)

glimpse(BPRSL) # check structure and first few values of data
## Observations: 360
## Variables: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
str(BPRSL) # str() shows also number of levels for factor variables
## 'data.frame':    360 obs. of  4 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Week     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
summary(BPRSL)
##  treatment    subject         Week        bprs      
##  1:180     1      : 18   Min.   :0   Min.   :18.00  
##  2:180     2      : 18   1st Qu.:2   1st Qu.:27.00  
##            3      : 18   Median :4   Median :35.00  
##            4      : 18   Mean   :4   Mean   :37.66  
##            5      : 18   3rd Qu.:6   3rd Qu.:43.00  
##            6      : 18   Max.   :8   Max.   :95.00  
##            (Other):252

The data consists of bprs score (summary for severity of scizophrenia symptoms observed by healthcare professional, with higher value indicating more symptoms) of 20 patients (identified by subject). The bprs-score was recorded weekly for 9 weeks, and starting from week 1, the patients received two alternative treatments.

6.2a graphs

Let’s start again by plotting the development of symptoms in 20 patients over time.

ggplot(BPRSL, aes(x = Week, y = bprs, group = interaction(subject,treatment))) +
  geom_point(aes(shape = treatment,colour=treatment)) +
  geom_line(aes(linetype=treatment,colour=treatment)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme_bw() +
  scale_x_continuous(name = "Time (weeks)") +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "none") +
  ggtitle("Symptom development in individuals over time")

This figure is not hugely informative, but it shows that there is considerable variation in symptom profiles of individuals over time. Severity of symptoms (bprs) tends to decrease over time in both groups, and the overall symptoms might be slightly lower in treatment 1 compared to treatment 2, but from this figure, it is impossible to say with any certainty.

We can also plot the repeated measurements from each individual (wide-format data) to check if there is correlation structure in the data.

BPRS <- read.table("D:/Opiskelu/MOOC-kurssi/IODS-project/data/BPRS.txt",header=TRUE) #same data in wide format
glimpse(BPRS)
## Observations: 40
## Variables: 11
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ week0     <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week1     <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, ...
## $ week2     <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, ...
## $ week3     <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, ...
## $ week4     <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, ...
## $ week5     <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, ...
## $ week6     <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, ...
## $ week7     <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, ...
## $ week8     <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, ...
pairs(BPRS[,3:11])

It looks like at least the bprs scores from closely associated weeks (close to the diagonal) are fairly strongly correlated. This is one example of a correlation/covariance structure in the data that could and should be modelled.

6.2b selecting the best fitting linear mixed effect model

Our final aim is to explore whether there are interactive or main effects of week and treatment in the data, when individual patients are allowed to
a) have overall different symptom levels (random intercept)
b) have overall differences and slopes (tendency for change over time) in symptom levels (random intercept and slope).

We will also look at a couple of simpler linear (non-mixed) models, and use LR-tests to investigate whether the improved (more complex) model structures will be required to model the variability in data in as parsimonous manner as possible.

bprs1 <- lm(bprs~Week+treatment,data=BPRSL) # main effect model, no interactive term between week and treatment
bprs2 <- lm(bprs~Week*treatment,data=BPRSL) # main + interactive effects of week and treatment

anova(bprs1,bprs2) #LR-test: is `week x treatment` interaction required for the most parsimonous model?
## Analysis of Variance Table
## 
## Model 1: bprs ~ Week + treatment
## Model 2: bprs ~ Week * treatment
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    357 54584                           
## 2    356 54277  1    307.45 2.0166 0.1565
# - no (P>0.1), at least if the model does not include random parts

bprs3 <- lme4::lmer(bprs~Week*treatment+(1|subject),data=BPRSL,REML=FALSE) # random intercept
bprs4 <- lme4::lmer(bprs~Week*treatment+(Week|subject),data=BPRSL,REML=FALSE) # random intercept and slope

anova(bprs3,bprs4) #LR-test: is random slope required for the most parsimonous models?
## Data: BPRSL
## Models:
## bprs3: bprs ~ Week * treatment + (1 | subject)
## bprs4: bprs ~ Week * treatment + (Week | subject)
##       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bprs3  6 2747.8 2771.1 -1367.9   2735.8                           
## bprs4  8 2744.3 2775.4 -1364.1   2728.3 7.4802      2    0.02375 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# - yes (P=0.024)

bprs5 <- lme4::lmer(bprs~Week+treatment+(Week|subject),data=BPRSL,REML=FALSE) # random intercept and slope, no interaction between main effects?
anova(bprs4,bprs5) #LR-test: is `week x treatment` interaction required for the most parsimonous model, when model includes random intercept and slope?
## Data: BPRSL
## Models:
## bprs5: bprs ~ Week + treatment + (Week | subject)
## bprs4: bprs ~ Week * treatment + (Week | subject)
##       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bprs5  7 2745.4 2772.6 -1365.7   2731.4                           
## bprs4  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# - maybe (P=0.075)

Sometimes LR-tests can be quite useful tool for selecting which variables (or interactions) to keep in the final model. This time, however, the result was not very definite. Even if LR-tests clearly indicated that a model with random intercept and slope fitted the data more closely than the model with random intercept only, it was not very clear from test result whether including the the (fixed) interaction between time and treatment would improve the model fit. Sometimes this can be determined by going back to the study questions - if we expect that there might be an interactive effect (e.g., if we expect that one of the treatments might start reducing the symptoms earlier), this term should not be removed from the final model.

6.2c interpretation of results

In this case, we will take a look at results of both models bprs4 and bprs5 to see if this matters for interpretation of results. (In real research I would consider this risky, because if you check the results before deciding on the final model, you might be tempted to pick the results you like instead of the results reflecting the data better). We will re-fit the final models using the package lmerTest, where the summary() prints also tests for model coefficients.

bprs4 <- lmerTest::lmer(bprs~Week*treatment+(Week|subject),data=BPRSL,REML=FALSE) # time x treatment -interaction included
anova(bprs4)
## Type III Analysis of Variance Table with Satterthwaite's method
##                Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## Week           5610.8  5610.8     1    20 58.1603 2.421e-07 ***
## treatment       138.9   138.9     1   320  1.4403   0.23097    
## Week:treatment  307.5   307.5     1   320  3.1870   0.07517 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(bprs4)
## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ Week * treatment + (Week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           Week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)      47.8856     2.2521  29.6312  21.262  < 2e-16 ***
## Week             -2.6283     0.3589  41.7201  -7.323 5.24e-09 ***
## treatment2       -2.2911     1.9090 319.9977  -1.200   0.2310    
## Week:treatment2   0.7158     0.4010 319.9977   1.785   0.0752 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Week   trtmn2
## Week        -0.650              
## treatment2  -0.424  0.469       
## Wek:trtmnt2  0.356 -0.559 -0.840
bprs5 <- lmerTest::lmer(bprs~Week+treatment+(Week|subject),data=BPRSL,REML=FALSE) # time x treatment -interaction not included
anova(bprs5)
## Type III Analysis of Variance Table with Satterthwaite's method
##           Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## Week      5665.8  5665.8     1    20 58.1525 2.425e-07 ***
## treatment   29.5    29.5     1   320  0.3025    0.5827    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(bprs5)
## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ Week + treatment + (Week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           Week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     2.1052  22.6796  22.066  < 2e-16 ***
## Week         -2.2704     0.2977  19.9985  -7.626 2.42e-07 ***
## treatment2    0.5722     1.0405 320.0007   0.550    0.583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) Week  
## Week       -0.582       
## treatment2 -0.247  0.000

In this case, the interpretation between models with different fixed effect structure did not change (effect related to interaction term in model bprs4 was not significant, based on an F-test - P=0.075). The interpretation results from both models is that scizophrenia-symptoms decreased over time (negative estimate for week), but there was no difference in treatments, at least in this dataset.

We can finish by plotting fitted values from the more complex model and comparing it to the observed values.

ggplot(BPRSL, aes(x = Week, y = bprs, group = interaction(subject,treatment))) +
  geom_point(aes(shape = treatment,colour=treatment)) +
  geom_line(aes(linetype=treatment,colour=treatment)) +
  theme_bw() +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Time (weeks)") +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "bottom") +
  ggtitle("Observed symptom development")

Fitted <- fitted(bprs4)
BPRSL <- BPRSL %>%  mutate(Fitted)
ggplot(BPRSL, aes(x = Week, y = Fitted, group = interaction(subject,treatment))) +
  geom_point(aes(shape = treatment,colour=treatment)) +
  geom_line(aes(linetype=treatment,colour=treatment)) +
  theme_bw() +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Time (weeks)") +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "bottom") +
  ggtitle("Fitted symptom development")

The trends visible from fitted values have some resemblance to observed values but do not capture the variability inherent to the first figure. To conclude, we could say that both treatments are equally effective in reducing the symptoms, although it might be worthwhile to study the individual variability (with techniques and plots similar to the RATS data) and consider the possibility of outliers, as well as spend some time checking model assumptions before fixating to the conclusion.



  1. For a full list of data attributes and their abbreviations, see https://archive.ics.uci.edu/ml/datasets/Student+Performance

  2. For a full list of data attributes, see https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html